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Abstract 

Language  modeling  is  a  difficult  problem  for  languages  with 
rich  morphology.  In  this  paper  we  investigate  the  use  of 
morphology-based  language  models  at  different  stages  in  a 
speech  recognition  system  for  conversational  Arabic.  Class- 
based  and  single-stream  factored  language  models  using  mor¬ 
phological  word  representations  are  applied  within  an  N-best 
list  rescoring  framework.  In  addition,  we  explore  the  use  of 
factored  language  models  in  first-pass  recognition,  which  is  fa¬ 
cilitated  by  two  novel  procedures:  the  data-driven  optimization 
of  a  multi-stream  language  model  structure,  and  the  conversion 
of  a  factored  language  model  to  a  standard  word-based  model. 
We  evaluate  these  techniques  on  a  large-vocabulary  recognition 
task  and  demonstrate  that  they  lead  to  perplexity  and  word  error 
rate  reductions. 


1.  Introduction 

A  standard  statistical  language  model  (LM)  computes  the  prob¬ 
ability  of  a  word  sequence  W  =  wi,  W2,  ...jWt  as  a  product  of 
the  conditional  probabilities  of  each  word  ro,  given  its  history, 
which  is  typically  approximated  by  the  one  or  two  most  recent 
words.  Even  with  this  limitation,  the  estimation  of  LM  prob¬ 
abilities  is  challenging  since  many  word  contexts  are  observed 
infrequently  or  not  at  all.  This  is  particularly  problematic  for 
morphologically  rich  languages,  e.g.  Turkish,  Russian,  or  Ara¬ 
bic.  Such  languages  have  a  high  vocabulary  growth  rate,  which 
results  in  high  language  model  perplexity  and  a  large  number 
of  out-of-vocabulary  (OOV)  words  (see  e.g.  [1, 2,  3,  4,  5]).  Re¬ 
cently,  Factored  Language  Models  (FLMs)  [5,  6]  have  been  de¬ 
veloped  to  address  this  problem.  FLMs  decompose  words  into  a 
number  of  features  and  use  the  resulting  representation  in  a  gen¬ 
eralized  backoff  scheme  that  improves  the  robustness  of  prob¬ 
ability  estimates  for  rarely  observed  word  n-grams.  A  straight¬ 
forward  way  to  use  FLMs  and  other  morphology-based  LMs  in 
automatic  speech  recognition  (ASR)  is  by  rescoring  N-best  lists 
and  combining  scores  from  different  models  for  final  hypothesis 
selection.  Here,  we  present  results  using  this  technique  as  well 
as  two  extensions  to  this  approach:  (a)  the  automatic  optimiza¬ 
tion  of  FLM  design  parameters  using  a  data-driven  procedure, 
and  (b)  the  use  of  FLMs  in  first-pass  recognition  rather  than 
rescoring. 


2.  Factored  Language  Models 

FLMs  decompose  each  word  w  into  a  set  of  k  features  (oi  fac¬ 
tors),  i.e.  w  =  .  Factors  represent  morphological,  syntac¬ 

tic,  or  semantic  word  information  and  can  be  e.g.  stems,  POS 
tags,  etc.  in  addition  to  the  words  themselves.  Probabilistic  LMs 


are  then  constructed  over  (sub)sets  of  factors.  Using  a  trigram 
approximation,  this  can  be  expressed  as: 

/t"")  ~  L-f )  (1) 

t  =  3 

Each  word  is  dependent  not  only  on  a  single  stream  of  tempo¬ 
rally  preceding  word  variables,  but  also  on  additional  parallel 
streams  of  features.  Such  a  representation  can  be  used  to  back 
off  to  factors  when  the  word  n-gram  has  not  been  observed  in 
the  training  data,  thus  improving  probability  estimates.  For  in¬ 
stance,  a  word  higram  may  not  have  any  counts  in  the  training 
set,  but  its  corresponding  factor  combinations  (e.g.  stems  and 
other  morphological  tags)  may  have  been  observed  since  they 
also  occur  in  other  words.  This  is  achieved  via  a  new  general¬ 
ized  parallel  backoff  technique.  In  standard  Katz-style  backoff, 
the  maximum-likelihood  estimate  of  an  n-gram  with  too  few 
observations  in  the  training  data  is  replaced  with  a  probability 
derived  from  the  lower-order  (n  —  l)-gram  and  a  backoff  weight 
as  follows: 

PBo{wt\wt-l,Wt-2)  (2) 

_  f  dcPML{wt\wt-l,Wt-2)  if  C  >  Ts 

\  a{wt-i,  r«t-2)pBo(w't|w't-i)  otherwise 

where  c  is  the  count  of  (wt,wt-i,wt-2),  Pml  denotes  the 
maximum-likelihood  estimate,  dc  is  a  discounting  factor  and 
a(wt-i,  wt-2)  is  a  normalization  factor.  During  standard  back¬ 
off,  the  most  distant  conditioning  variable  (in  this  case  wt-2) 
is  dropped  first,  followed  by  the  second  most  distant  variable 
etc.,  until  the  unigram  is  reached.  This  can  be  visualized  as  a 
backoff  path  (Figure  1(a)).  If  additional  conditioning  variables 
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Figure  1 :  Standard  backoff  path  for  a  4-gram  language  model 
over  words  (left)  and  backoff  graph  for  4-gram  over  factors 
(right). 


are  used  which  do  not  form  a  temporal  sequence,  it  is  not  im¬ 
mediately  obvious  in  which  order  they  should  be  dropped.  In 
this  case,  several  backoff  paths  are  possible,  which  can  be  sum¬ 
marized  in  a  backoff  graph  (Figure  1(b)).  Paths  in  this  graph 
can  be  chosen  in  advance  based  on  linguistic  knowledge,  or 
at  run-time  based  on  statistical  criteria  such  as  counts  in  the 
training  set.  It  is  also  possible  to  choose  multiple  paths  and 
combine  their  probability  estimates.  The  use  of  multiple  con¬ 
ditioning  factors  is  similar  to  the  procedure  described  in  [7]  but 
is  more  general  in  that  it  allows  arbitrary  backoff  paths  instead 
of  imposing  an  a  priori  ordering  of  more  specific  to  more  gen¬ 
eral  probability  distributions.  Moreover,  it  provides  different 
combination  methods  for  probability  estimates  obtained  from 
different  paths.  This  is  achieved  by  replacing  the  backed-off 
probability  pBO  in  Equation  2  by  a  general  function  g,  which 
can  be  any  non-negative  function  applied  to  the  counts  of  the 
lower-order  n-gram.  Several  different  g  functions  can  be  cho¬ 
sen,  e.g.  the  mean,  weighted  mean,  product,  minimum  or  max¬ 
imum  of  the  smoothed  probability  distributions  over  all  subsets 
of  conditioning  factors  [5].  In  addition  to  different  choices  for 
g,  different  discounting  parameters  can  be  selected  at  different 
levels  in  the  backoff  graph.  For  instance,  at  the  topmost  node, 
Kneser-Ney  discounting  might  be  used  whereas  at  a  lower  node 
Good-Turing  might  be  applied.  FLMs  have  been  implemented 
as  an  add-on  to  the  widely-used  SRILM  toolkit.  Further  de¬ 
tails  can  be  found  in  [5].  One  difficulty  in  training  FLMs  is 
the  choice  of  the  best  combination  of  design  choices,  in  par¬ 
ticular  the  conditioning  factors,  backoff  path(s)  and  smoothing 
options.  Since  the  space  of  different  combinations  is  too  large 
to  be  searched  exhaustively,  we  have  developed  an  automatic 
procedure  to  optimize  FLMs,  further  described  in  Section  4.2. 

3.  Data  and  Baseline  System 

The  experiments  reported  here  were  run  on  the  LDC  CallHome 
corpus  of  Egyptian  Colloquial  Arabic  (EGA).  The  training  set 
consists  of  the  training,  hub5_new  and  eval96  subsets  and  con¬ 
tains  120  conversations  (~180K  words)  in  total.  The  develop¬ 
ment  set  (dev)  has  32K  words  and  the  two  test  sets  have  18K 
(eval97)  and  1  IK  (eval03)  words,  respectively.  The  recognizer 
was  trained  on  the  ’romanized’  transcriptions  of  the  data.  9% 
of  all  word  tokens  are  disfluencies  and  1.6%  are  foreign  words. 
The  recognition  dictionary  consisted  of  18K  words. 

For  recognition  we  use  the  SRI  DECIPHER^^  system. 
The  front-end  consists  of  52  mel-frequency  cepstral  coefficients 
(13  base  coefficients  -l-  1st  -l-  2nd  -l-3rd  differences),  reduced 
with  HLDA  to  39  dimensions.  Mean  and  variance  as  well  as 
vocal  tract  length  normalization  are  performed  for  speaker  clus¬ 
ters  (the  waves  for  each  conversation  side  were  clustered  into 
an  average  of  3  speaker  clusters).  Continuous-density,  genonic 
hidden  Markov  models  [8]  with  128  Gaussians  per  genone  are 
used.  The  system  contains  approximately  220  genones.  The 
decoder  uses  a  multipass  approach:  In  the  first  pass  (Stage 
I),  N-best  hypotheses  are  generated  using  phoneloop-adapted 
non-crossword  models  and  a  bigram  LM.  Maximum  word  pos¬ 
terior  hypotheses  are  obtained  using  N-best  ROVER,  which 
are  then  used  to  train  speaker-adaptive  training  (SAT)  and 
maximum-likelihood  linear  regression  (MLLR)  transforms  for 
each  speaker.  The  adapted  models  are  used  in  the  second  pass 
to  produce  bigram  lattices.  The  lattices  are  rescored  with  a  tri¬ 
gram  LM  and  are  used  as  recognition  networks  for  the  follow¬ 
ing  passes.  Two  more  passes  are  performed,  one  using  adapted 
non-crossword  maximum-mutual-information  (MMI)  trained 
models,  and  one  using  adapted  crossword  maximum-likelihood 


trained  models.  Thus  we  obtain  two  sets  of  N-best  hypotheses, 
each  of  which  is  rescored  with  additional  morphology-based 
LMs  as  described  below  (Stages  2a  and  2b).  The  final  hypothe¬ 
ses  are  generated  by  2-way  N-best  ROVER  (Stage  3). 

4.  ASR  Using  Morphology-based  LMs 

Morphological  information  for  language  modeling  is  obtained 
by  extracting  the  stem  and  the  morphological  class  for  each 
word  from  the  CallHome  ECA  lexicon,  and  by  using  a  morpho¬ 
logical  analyzer  [9]  to  further  decompose  the  stem  into  a  root 
(typically  a  sequence  of  three  consonants)  and  a  pattern  (a  se¬ 
quence  of  consonant  and  vowel  slots)  (cf.  [5]).  Root  and  pattern 
decomposition  is  noisy  since  the  analyzer  was  developed  for  a 
different  dialect  of  Arabic. 

4.1.  Fixed  FLM  Topologies 

In  the  system  submitted  for  the  RT-03  benchmark  evaluations 
[10],  we  used  morphology-based  language  models  to  rescore 
the  N-best  lists  prior  to  applying  ROVER.  The  factors  consid¬ 
ered  for  LM  training  were:  root,  stem,  and  morphological  class. 
For  each  of  the  two  sets  of  N-best  lists,  a  different  combina¬ 
tion  of  rescoring  LMs  was  employed.  The  first  used  three  class- 
based  LMs,  where  the  classes  were  defined  based  on  each  of  fhe 
above-mentioned  factors.  The  second  used  three  FLMs,  each 
with  a  fixed  backoff  path  allowing  backoff  only  to  a  single  fac¬ 
tor.  This  led  to  word  error  rate  reductions  on  the  eval03  set 
of  0.8%  and  1.5%  (absolute),  respectively.  In  the  final  2-way 
ROVER  combination  pass  we  obtained  improvements  of  1.3% 
and  0.8%  on  the  dev  and  eval03  test  sets,  respectively  (see  “N- 
best”  columns  in  Table  3). 

4.2.  Automatic  FLM  Parameter  Search 

Since  the  space  of  possible  FLM  structures  is  very  large  we 
explored  the  use  of  Genetic  Algorithms  (GAs)  to  optimize  the 
choice  of  conditioning  factors,  backoff  paths,  and  smoothing 
options.  GAs  [11]  encode  problem  solutions  as  strings  (genes), 
and  evolve  successive  populations  of  solutions  through  the  use 
of  genetic  operators  (e.g.  selection,  cross-over,  mutation).  The 
probability  of  each  gene’s  survival  is  dependent  on  its  fitness 
function,  which  represents  the  desired  optimization  criterion. 
In  this  case,  each  gene  represents  an  FLM  with  a  specific  set 
of  parameters,  i.e.  the  initial  conditioning  factors,  the  backoff 
graph,  and  the  smoothing  options.  The  fitness  function  is  the 
FLM’s  perplexity  on  the  development  set. 

The  main  problem  in  applying  GAs  to  our  current  task 
is  to  find  a  good  encoding  of  the  problem.  The  ini¬ 
tial  set  of  conditioning  factors  F  is  encoded  as  a  binary 
string.  For  instance,  a  trigram  for  a  word  representation  with 
three  factors  (A,B,C)  has  six  potential  conditioning  variables: 
{A-i,  B-\ ,  C-\ ,A-2,  B-2  ,  C-2  }  which  can  be  represented 
as  a  6-bit  binary  string,  with  a  bit  set  to  1  indicating  presence 
and  0  indicating  absence  of  a  factor  in  F.  The  string  10011 
would  correspond  to  F  =  {A-i,  B-2,C-2}.  The  backoff 
graph  is  encoded  by  means  of  graph  grammar  rules  (similar 
to  [12]),  since  a  direct  approach  encoding  every  edge  as  a  bit 
would  result  in  overly  long  strings  and  inefficient  GA  search. 
(There  are  up  to  m!  backoff  paths  for  a  FLM  with  m  initial  fac¬ 
tors).  The  grammar  rules  capture  the  regularity  that  a  node  with 
m  factors  can  only  back  off  to  children  nodes  with  m  —  1  fac¬ 
tors.  For  instance,  for  m  =  3,  the  choices  for  proceeding  to  the 
next-lower  level  in  the  backoff  graph  can  be  described  by  the 
following  grammar  rules: 


RULEl:  {  X\,X2,  Xs}  ^  {x\,  X2} 

RULE  2:  {  Xl,  X2,  X3}  ^  {x\,  X3} 

RULES:  {  Xl,  X2,  1:3}  ^  {X2,  1:3} 

Here  x,  corresponds  to  the  factor  at  the  ith  position  in  the  par¬ 
ent  node.  Rule  1  indicates  a  backoff  that  drops  the  third  fac¬ 
tor,  Rule  2  drops  the  second  factor,  etc.  The  choice  of  rules 
used  to  generate  the  backoff  graph  is  encoded  in  a  binary  string, 
with  1  indicating  the  use  and  0  indicating  the  non-use  of  a  rule. 
The  backoff  graph  grows  according  to  the  rules  specified  by  the 
gene,  as  shown  schematically  in  Figure  2.  The  smoothing  op¬ 
tions  are  encoded  as  tuples  of  integers,  each  specifying  the  dis¬ 
counting  method  and  backoff  threshold  at  a  node  in  the  graph. 
Finally,  the  GA  operators  are  applied  to  concatenations  of  all 
three  substrings  describing  the  set  of  factors,  backoff  graph, 
and  smoothing  options,  such  that  all  parameters  are  optimized 
jointly.  Table  1  lists  the  perplexities  of  the  word-based  n-grams, 
of  the  best  FLMs  obtained  by  a  manual  parameter  search,  ran¬ 
dom  search,  and  the  GA-based  search.  We  observe  that  the  GA 
procedure  leads  to  3%  (bigram)  and  6%  (trigram)  relative  re¬ 
ductions  in  perplexity  and  performs  better  than  either  manual 
or  random  search.  Models  were  optimized  to  reduce  the  per¬ 
plexity  on  the  known  words,  ignoring  the  probability  given  to 
OOV  words.  This  constraint  prevents  the  GA  from  minimizing 
perplexity  by  choosing  models  which  assign  high  probability 
to  OOV  words  rather  than  to  words  present  in  the  recognition 
dictionary.  The  best-performing  FLMs  use  all  morphological 
factors  (stems,  morph  classes,  roots  and  patterns  in  addition  to 
words)  and  parallel  backoff  with  different  smoothing  options  at 
different  nodes  in  the  backoff  graph. 


PRODUCTION  RULES: 

1.  {Xl  X2X3)  ->  (Xl  X21  • 

2.  (Xl  X2X3)  ->  (Xl  X31 

3.  (Xl  X2X3)  ->  {X2X31  • 

4.  1X1X2)  ->{X1)  • 

5.  1X1X2)  ->{X2) 


GENE: 

10110 


(a)  Gene  activates  production  rules 


(b)  Generation  of  Backoff  Graph  by  rules  1,3,  and  4 


Figure  2:  Generation  of  Backoff  Graph  from  production  rules 
selected  by  the  gene  10110. 


Model 

word 

FLM  manual 

FLM  rand 

FLMGA 

bigram 

trigram 

229.9 

227.1 

229.6 

223.2 

229.9 

230.3 

226.1 

212.6 

Table  1:  Bigram  and  trigram  perplexities  on  the  CH  dev  set  for 
word-based  LMs  and  for  FLMs  obtained  by  manual,  random 
(rand)  and  genetic  search  (GA). 


Set 

dev 

eval97 

eval03 

n 

2 

3 

2 

3 

2 

3 

I 

230 

227 

227 

222 

132 

123 

II 

223 

213 

222 

209 

136 

89 

III 

250 

227 

249 

225 

145 

141 

IV 

226 

217 

225 

215 

137 

137 

Table  2:  Bigram  and  trigram  perplexities  obtained  by:  the  word- 
based  baseline  LM  (I),  the  FLM  (II),  the  baseline  LM  rescored 
with  the  FLM  without  adding  additional  n-grams  (III),  and  with 
added  n-grams  (IV),  on  the  different  CH  sets. 


4.3.  Converting  FLMs  to  Word-based  LMs 

Since  promising  results  were  obtained  by  applying  morphologi¬ 
cal  knowledge  during  rescoring,  we  expect  to  gain  a  further  im¬ 
provement  when  applying  it  at  earlier  recognition  passes.  Better 
hypotheses  at  early  passes  can  positively  influence  adaptation 
and  re-recognition  results  at  later  passes.  However,  the  use  of 
FLMs  in  first-pass  recognition  is  problematic  because  standard 
word-based  decoders  cannot  process  the  decomposed  word  rep¬ 
resentations  required  by  FLMs.  For  this  reason  we  use  a  novel 
feature  of  the  SRILM  toolkit  that  allows  us  to  ’rescore’  a  word- 
based  language  model  with  an  FLM.  First,  the  entries  in  the 
word-based  LM  are  converted  to  a  factored  representation  based 
on  a  lexicon.  They  are  then  passed  through  the  FLM  trained 
on  the  decomposed  training  text  and  are  assigned  new  proba¬ 
bilities  from  this  FLM.  After  renormalization,  the  entries  are 
converted  back  to  words  and  written  out  as  a  new  LM  in  stan¬ 
dard  ARPA  format  for  use  with  a  word-based  decoder.  When 
applied  to  a  development  or  test  set,  the  rescored  word  LM  ob¬ 
tains  a  higher  perplexity  than  the  corresponding  FLM.  This  is 
because  unseen  word  n-grams  in  the  new  text  can  be  assigned 
probabilities  in  the  FLM  by  backing  off  to  previously  encoun¬ 
tered  factor  combinations  (e.g.  morph  class  or  stem  n-grams); 
however,  if  the  corresponding  word  n-grams  are  not  present  in 
the  original  word-based  LM,  they  will  not  be  present  in  the  to 
the  new  rescored  LM.  For  this  reason,  additional  word  n-grams 
need  to  be  added  prior  to  rescoring  in  order  to  derive  the  max¬ 
imum  benefit  from  the  FLM.  Adding  all  possible  bigrams  and 
trigrams  is  clearly  infeasible.  We  select  bigrams  which  do  not 
exist  in  the  original  training  data  by  searching  over  all  possible 
bigrams  and  retaining  those  for  which 

Pflm{w,  h){log{pFLM{w\h  j)  -  log{p,,,ord{w\h)  j)  >  e 

where  h  is  the  word  history,  pflm  is  the  probability  obtained 
by  the  original  FLM  and  p^ord  is  the  probability  obtained  by 
the  word  LM  (cf.  [13]).  The  value  of  e  was  chosen  such  that 
Pword  would  be  within  2%  of  that  of  the  FLM.  Since  a  com¬ 
parable  search  over  the  entire  trigram  space  is  infeasible,  we 
only  search  over  those  trigrams  for  which  both  word  bigrams 
have  already  been  added  based  on  the  above  criterion.  Table  2 
compares  the  perplexities  on  the  dev  and  eval  sets  obtained  by 
different  language  models. 

The  results  show  that  the  use  of  FLMs  (line  II)  leads  to  per¬ 
plexity  reductions  on  all  sets  with  the  exception  of  the  bigram 
on  the  eval03  set.  Since  reductions  are  achieved  on  the  eval97 
set,  it  is  unlikely  that  this  is  due  to  an  overfitting  to  the  devel¬ 
opment  data  by  the  GA  search  procedure;  rather,  it  seems  to  be 
the  case  that  the  eval03  is  very  different  in  nature  from  the  other 
two  sets.  This  is  confirmed  by  the  much  lower  perplexities  and 
typical  word  error  rates  (around  40%)  obtained  on  this  set.  The 


1  dev 

eval97 

eval03 

Stage 

baseline 

FLM  Nbest 
rescoring 

FLM 
all  passes 

baseline 

FLM  Nbest 
rescoring 

FLM 
all  passes 

baseline 

FLM  Nbest 
rescoring 

FLM 
all  passes 

(1) 

57.3 

56.2 

61.7 

61.0 

46.7 

46.3 

(2a) 

54.8 

53.4 

52.7 

58.2 

56.9 

56.5 

40.8 

39.9 

40.2 

(2b) 

54.3 

53.0 

52.5 

58.8 

57.9 

57.4 

41.0 

39.5 

40.1 

(3) 

53.9 

52.6 

52.1 

57.6 

56.6 

56.1 

40.2 

39.4 

39.6 

Table  3:  Word  error  rates  (in  %)  obtained  by  the  baseline  system  ,  the  system  using  morphology-based  LMs  for  N-best  list  rescoring, 
and  the  system  using  morphology-based  FLM  in  all  recognition  passes  and  the  previous  models  for  N-best  rescoring,  at  different 
recognition  stages  as  described  in  Section  3. 


differences  between  rows  II  and  III/IV  demonstrate  the  loss  in 
performance  due  to  the  rescoring  procedure  described  above, 
which  prevents  us  from  exploiting  the  benefits  of  FLMs  to  the 
full  extent.  This  is  particularly  obvious  for  the  trigram  applied 
to  the  eval03  set.  Since  trigrams  that  are  added  in  IV  are  depen¬ 
dent  on  previously  added  bigrams,  perplexity  does  not  decrease 
but  increase  in  this  case. 

4.4.  Recognition  Experiments 

In  order  to  evaluate  the  total  effect  of  the  morphology-based 
LMs  in  the  multipass  system,  we  replaced  the  standard  word- 
based  LM  used  in  the  baseline  with  Model  IV  in  Table  2. 
Recognition  results  (Table  3)  show  that  the  use  of  morphology- 
based  LMs  improves  WER  by  1.8%  absolute  on  the  dev  set  and 
1.5%  on  the  eval97  set.  One  third  of  that  improvement  (0.5%) 
is  due  to  the  use  of  morphological  knowledge  throughout  all  the 
recognition  passes. 

The  last  columns  in  Table  3  show  the  results  on  the  eval03 
set.  Here,  the  application  of  morphological  LMs  in  the  rescor¬ 
ing  pass  leads  to  improvements  comparable  to  those  on  the  dev 
and  eval97  sets;  however,  the  use  of  the  rescored  LM  from  the 
first-pass  slightly  hurts  rather  than  helps  the  performance.  This 
is  most  likely  due  to  the  increase  in  perplexity  of  the  rescored 
model  on  this  set,  as  explained  above. 

5.  Discussion 

We  have  shown  that  the  use  of  morphology-based  LMs  at  dif¬ 
ferent  stages  in  an  LVCSR  system  for  Arabic  leads  to  word  er¬ 
ror  rate  reductions.  One  drawback  of  the  current  approach  is 
that  the  full  potential  of  the  FLM  cannot  be  exploited  directly 
since  factored  word  representations  are  not  supported  by  current 
decoders.  Future  work  will  focus  on  creating  better  interfaces 
between  the  decoder  and  factored  language  models,  and  on  ex¬ 
tending  the  current  method  by  adding  out-of-vocabulary  words 
with  probabilities  assigned  by  morphological  language  models. 
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